Week 03: Parsing & Top-Down Parsing

What is parsing?

Goal: To derive the parse tree of program in grammar.
Input: Sequence of tokens from lexer
Output: Parse tree of the program.
Sample
- Input: IF ID = ID THEN INT ELSE INT FI
- Output: (IF-THEN-ELSE (= ID ID) (INT) (INT))

Context-Free Grammars

Goal: To separate the token string into many pieces such that every piece is acceptable in grammar.
- Context-Free means every two pieces are not related.
- Parser must distinguish between valid and invalid strings of tokens under CFGs.
What we need
- A language for describing valid strings of tokens.
- A method for distinguishing valid under the language.
Note: Programming languages always have recursive structures.
- An EXPR is ... (recursive)
  - if EXPR then EXPR else EXPR fi
- Context-free grammars are a natural notation for this recursive structure.
Def. A CFG consists of
- A set of terminals $T$
- A set of non-terminals $N$
- A start symbol $S \in N$
- A set of productions $X \rightarrow Y_1, \dots, Y_N, X\in N, Y_i \in N\cup T\cup \{ \epsilon\}$
  - Ex: $S\rightarrow (S)$
  - How the production works?
    - Begin with a string with only the start symbol $S$
    - Replace any non-terminal $X$ in the string by the right-hand side of some production $X \rightarrow Y_1\dots Y_n$
    - Repeat (2) until there are no non-terminals
      - $X_1\dots X_i X X_{i+1}\dots X_N \rightarrow X_1\dots X_i Y_1\dots Y_k X_{i+1}\dots X_N,\ if\ X\rightarrow Y_1\dots Y_k$
      - $S \rightarrow \dots \rightarrow \alpha_0 \rightarrow \alpha_1 \rightarrow \dots \rightarrow \alpha_n, if\ \alpha_0 \rightarrow^* \alpha_n (\geq 0\ steps)$
Def. (Language) Use grammar to define language.
- Let $G$ be a context-free grammar with start symbol $S$. Then the language $L(G)$ of $G$ is $$\{ a_1\dots a_n\ |\ \forall i\ (a_i\in T)\wedge (S\rightarrow^* a_1\dots a_n) \}$$
- Terminals ought to be tokens of the language.
- Membership in a language is yes or no; also need parse tree of the input.
- Must handle errors gracefully.
- Need an implementation of CFG’s (e.g., bison)
- Form of the grammar is important.
  - Many grammars generate the same language.
  - Tools are sensitive to the grammar.

Derivations

Goal: To define the process from the start symbol to some valid token string.
- Def. A derivation is a sequence of productions applying on a start symbol $S$ to sequences without any non-terminal.
- We can draw a derivation as a tree
  - Start symbol is the tree’s root
  - For a production $X \rightarrow Y_1\dots Y_n$ add children $Y_1\dots Y_n$ to node $X$.
- Sample
  - $G$: $E\rightarrow E+E\ |\ E*E\ |\ (E)\ |\ id$
  - Target: $id*id+id$
  - Derivation: $E \rightarrow E+E \rightarrow E*E+E \rightarrow id*E+E \rightarrow id*id+E \rightarrow id*id+id$
  - Draw derivation as a parse tree: $(((id) * (id)) + (id))$
  - The example is a left-most derivation. At each step, replace the left-most non-terminal.
  - Note that right-most and left-most derivations have the same parse tree.
A parse tree has
- Terminals at the leaves
- Non-terminals at the interior nodes
An in-order traversal of the leaves is the original input.
We are not just interested in whether $s \in L(G)$.
- We need a parse tree for $s$.
The relation between derivation and parse tree are many to one.
- ((id)+(id))

Ambiguity

Goal: A parse tree is used to describe how we explain the program string. If there are two parse trees for the program, it has two meanings.
Sample
- $G$: $E\rightarrow E+E\ |\ E*E\ |\ (E)\ |\ id$
- Target: $id*id+id$
- This string has two parse trees
  - Derivation: $E \rightarrow E+E \rightarrow E*E+E \rightarrow id*E+E \rightarrow id*id+E \rightarrow id*id+id$
  - Derivation: $E \rightarrow E*E \rightarrow E*E+E \rightarrow id*E+E \rightarrow id*id+E \rightarrow id*id+id$
Def. A grammar is ambiguous, if it has more than one parse tree for some string $s\in L(G)$.
- A lot of ways to solve it.
- Most direct method is to rewrite grammar unambiguously.
- Like, enforces precedence of * over +.
- Impossible to convert automatically an ambiguous grammar to an unambiguous one.
- Most tools allow precedence and associativity declarations to disambiguate grammars.

Error

Goal
- To detect non-valid programs
- To translate the valid ones
- Sample
  - (Lexer) Errors in Lexical: ...$...
  - (Parser) Errors in Syntax: ...x *%...
  - (Type checker) Errors in Semantic: ...int x; y=x(3);...
  - (Tester/User) Errors in Correctness: program

Error Handler

Goal
- Report errors accurately and clearly
- Recover from an error quickly
- Not slow down compilation of valid code
What to do
1. Panic mode
  - Discard tokens until one with a clear role is found. And continue.
  - Bison: use the special terminal error to describe how much input to skip.
2. Error Productions
  - Specify known common mistakes in the grammar, like 5x and 5*x...
  - Complicates the grammar.
3. Automatic local or global correction
  - Try token insertions and deletions. (Edit distance)
  - Exhaustive search.

Abstract Syntax Trees (AST)

Goal: Represent the parse tree and ignore the details.
Form: (op (op1) (op2),...,(opn))
- (IF-THEN-ELSE (< (x) (5)) (= (y) (4)) (= (y) (3)))

Top-Down Parsing

Recursive Descent Parsing
- The parse tree is constructed
  - From the top
  - From left to right
- Use DFS to match the token string
  - Let the global pointer next point to the next input token: TOKEN *next = array;
  - Try to match a given token terminal: bool term(TOKEN tok) { return *next++ == tok; }
  - Try to mathc the $n^{th}$ production of S: bool Sn() { ... }
  - For production $E\rightarrow T$: bool E1() { return T(); }
  - For production $E\rightarrow T+E$: bool E2() { return T() && term(PLUS) && E(); }
  - For all productions of E (with backtracking) bool E() { TOKEN *save = next; return (next = save, E1()) || (next = save, E2()); }
Error in Recursive Descent
- Scenario: $S \rightarrow Sa$
  - It goes into an infinite loop.
- RD does not work in left-recursive grammar.
  - Def. A left-recursive grammar has a non-terminal $S$, $S\rightarrow^+ S\alpha$, for some $\alpha$
  - Rewrite all production rules to use right-recursion.